Graham, Yvette, Timothy Baldwin, Alistair Moffat and Justin Zobel (to appear) Can Machine Translation Systems be Evaluated by the Crowd Alone? Natural Language Engineering
نویسندگان
چکیده
Crowd-sourced assessments of machine translation quality allow evaluations to be carried out cheaply and on a large scale. It is essential, however, that the crowd’s work be filtered to avoid contamination of results through the inclusion of false assessments. One method is to filter via agreement with experts, but even amongst experts agreement levels may not be high. In this paper, we present a new methodology for crowd-sourcing human assessments of translation quality, which allows individual workers to develop their own individual assessment strategy. Agreement with experts is no longer required, and a worker is deemed reliable if they are consistent relative to their own previous work. Individual translations are assessed in isolation from all others in the form of direct estimates of translation quality. This allows more meaningful statistics to be computed for systems and enables significance to be determined on smaller sets of assessments. We demonstrate the methodology’s feasibility in large-scale human evaluation through replication of the human evaluation component of WMT shared translation task for two language pairs, Spanish-to-English and English-to-Spanish. Results for measurement based solely on crowd-sourced assessments show system rankings in line with those of the original evaluation. Comparison of results produced by the relative preference approach and the direct estimate method described here demonstrate that the direct estimate method has a substantially increased ability to identify significant differences between translation systems.
منابع مشابه
Crowd-Sourcing of Human Judgments of Machine Translation Fluency
Human evaluation of machine translation quality is a key element in the development of machine translation systems, as automatic metrics are validated through correlation with human judgment. However, achievement of consistent human judgments of machine translation is not easy, with decreasing levels of consistency reported in annual evaluation campaigns. In this paper we describe experiences g...
متن کاملIs Machine Translation Getting Better over Time?
Recent human evaluation of machine translation has focused on relative preference judgments of translation quality, making it difficult to track longitudinal improvements over time. We carry out a large-scale crowd-sourcing experiment to estimate the degree to which state-of-theart performance in machine translation has increased over the past five years. To facilitate longitudinal evaluation, ...
متن کاملContinuous Measurement Scales in Human Evaluation of Machine Translation
We explore the use of continuous rating scales for human evaluation in the context of machine translation evaluation, comparing two assessor-intrinsic qualitycontrol techniques that do not rely on agreement with expert judgments. Experiments employing Amazon’s Mechanical Turk service show that quality-control techniques made possible by the use of the continuous scale show dramatic improvements...
متن کاملMeasurement of Progress in Machine Translation
Machine translation (MT) systems can only be improved if their performance can be reliably measured and compared. However, measurement of the quality of MT output is not straightforward, and, as we discuss in this paper, relies on correlation with inconsistent human judgments. Even when the question is captured via “is translation A better than translation B” pairwise comparisons, empirical evi...
متن کاملGraham, Yvette and Timothy Baldwin (to appear) Testing for Significance of Increased Correlation with Human Judgment, In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP 2014), Doha, Qatar
Automatic metrics are widely used in machine translation as a substitute for human assessment. With the introduction of any new metric comes the question of just how well that metric mimics human assessment of translation quality. This is often measured by correlation with human judgment. Significance tests are generally not used to establish whether improvements over existing methods such as B...
متن کامل